top of page
  • LinkedIn
  • 648256
  • avatars-000708374642-k6d7gm-t500x500
  • Instagram

The importance of Data in AI and Machine Learning and how to dealing with missing data




ree

Introduction

We are now surrounded by many types of data. Humans have long used these data, whether consciously or unconsciously, to drive our decisions, which can range from purchasing to artificial intelligence. Furthermore, much current scientific research indicates that it is more towards data utilisation, which has resulted in widespread data and computer integration.

We are currently flooded with data. Advances in data collecting and development have become a major driving force behind a lot of research. Artificial Intelligence and Machine learning specifically are two of the most effective methods for assisting humans in solving complicated problems through data analysis. Big data is also the foundation, regardless of which tools are used to analyse it (Machine Learning & Artificial Intelligence - Machine Learning).



ree

Because real-world datasets are partial( data can come from different sources like sensors, surveys, etc.), inconsistent, erroneous, and frequently contain missing data, the first stage in data analysis is data preparation. ML analyses data using algorithms and models, and its performance is totally dependent on the incoming data. The fundamental reason for this is because all algorithms and models rely on input data, missing or distorted data can entirely change the outcome data (Sarker, 2021). Data that has been prepared and organised is usually easy to control, making data analysis easier. Incorrectly prepared data, on the other hand, can make data analysis difficult or impossible. In addition, data from a variety of sources, as well as the acquisition of current data, are critical components of data preparation. Data preparation is said to account for up to 70% of the time and effort spent on data analysis initiatives. ( M. Lou, "Preprocessing Data for Neural Networks", Technical Analysis of Stocks & Commodities Magazine, Oct. 1993).


As a result, aiming to thrive and clean data ensures quick and cost-effective data mining with high-quality results analysis. Data processing entails a wide range of tasks. In this essay, I'll go over some of the basic ideas of data processing.


Steps in Data processing in Machine Learning:


Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data (Kuhn and Johnson, 2013).



#1 -Data Aggregation


Data aggregation, known as data summarization, is the first step in data processing which means deriving the right level of data detail for data mining. Data with every detail as recorder may be too big to handle within the time constraints, and also, not all details may be of interest to the investigation.

There are many online resources that can find and download datasets, but one of the most useful links is Google dataset Search which provides an engine to search your interest throughout datasets.

There are some good online resources below:

After acquiring the information, it must be formatted correctly, such as in a CSV, HTML, or XLSX file.


#2 - Missing Value and Data Cleaning


Missing data can indeed be caused by a variety of circumstances, including failing sensors, a blank survey entry, or the negligence of the individual inputting the data. So it's critical to pick the optimal technique for coping with this prevalent problem.


Importing our tools, which can be found here, is the initial step towards data cleaning.

- Numpy is a foundational package for a wide range of mathematical operations, ranging from typical trigonometric functions, arithmetic functions, and complex number handling.


- Pandas is a fantastic Python data processing and analysis tool. This open-source library is one of the most powerful tools for organizing and importing data collections. It provides Python with high-performance, easy-to-use data structures and data analysis capabilities.

- Matplotlib is a 2D Python drawing tool that can be used to create any style of chart. It can produce high-quality figures for numerous print copies and interactive settings on a variety of devices.


ree

1- Discard the bad record!

Dropping out any set of data with a missing value or an incorrect field is a common practice, but the quality of the remaining data in the dataset should be considered It's important to consider how much of our lost data set is made up of whole data.

In some domains, for example, more than 60% of data contains at least one missing value. Additionally, the data sets with missing values may be the most intriguing. So, if you're going to use this strategy, be cautious.



- The first step is to import our data by Pandas

ree

- We can view the summary of our datasets using some basic code.


- Identifying and handling the missing values


As I mentioned in the opening paragraph, missing values come in various forms, some of which are recognized by Pandas and others that are not.

1- Standard missing value:

In this image, as you can see, the missing value consists of blank cell and NA

which Pandas recognize both as none value






2- Non-Standard missing value:

The missing value in the last row(Left figure) is of a different type Pandas are unable to recognize it (Right figure ).


ree

- This is a common problem when numerous users are manually entering data.

Perhaps You like "n/a" while others prefer "na." Putting these different formats in a list is a simple approach to detect them. When we import the data, Pandas will immediately recognize them .





- Uninspected value


In this case, we may meet a value in the incorrect location, such as a number in a column that represents a Yes or No response (Left photo).

It is preferable to consider various strategies in this case. We can use different tactics in different scenarios, such as looping through each element in our column, as shown here. To see if the entry can be changed to an integer, and if it can, we convert it to a missing value (Right photo).



- Deleting row or column

ree


We now begin data cleaning after detecting the various categories of missing data. Using Pandas to delete a row or column is one technique to get rid of missing data.




- Here we tend to drop our “PID” column in the datasets with the drop function:


- We can do the same as for deleting a row

ree


#3 - Replacing the average value


- A simple and common way to deal with wrong data is with the average, median or specific number. The advantage of this is that it minimally affects the overall statics for that variable.


- Assign a value based on the value of the nearest neighbors. Finding the record with the highest resemblance to the record in question better approximates a substitute value. The notion is that if record A has a missing entry for variable x and record B is closer than any other value regardless of variable x, the value of variable x in record B will be used to fill in the blank.


Pandas are our tools to fill missing data, regardless of which method to enter in missing data.

ree

#3 - Replacing the average value

Categorical data should be converted to numerical form because computers only understand numerical variables.

ree

conclusion

I hope you found this post to be informative and useful. For anyone who desire a decent dataset for machine learning algorithms, I present an initial and simple task. Please, with your kind comment and follow, assist me in improving. Thank you very much.

Comments


bottom of page