Data Analysis For Machine Learning with Python (part-2)

Data Analysis For Machine Learning with Python (part-2)

21/10/19   20 minutes read     491 Naren Allam

In this article, you will learn practically how to prepare data for analysis, the basics of data analysis using python packages, creating meaningful data visualizations, predict future trends from data, and much more..!
To Know theoretically about data analysis follow python guru article
Data Analysis For Machine Learning with Python (part-1)

1.Data Requirement
2.Data Collection
3.Data processing
4.Exploratory Data Analysis
4.1.Data Profiling
4.2.Data Cleaning
4.3.Data Visualization

1.Data Requirement
For data analysis, in this article, the data requirement type is the product requirement. So here, the product is mobile and here required domain knowledge on mobiles, to understand about mobile features.

How does domain knowledge influence data Analysis?
You may have studied data science and machine learning and used some machine learning algorithms like regression, classification to predict on some test data. But the true power of an algorithm and data can be harnessed only when we have some form of domain knowledge. Needless to say, the accuracy of the model also increases with the use of such knowledge of data.
For example, the knowledge of the mobile industry when working with the relevant data can be used like — Let’s say we have 15 features in this data set.

Where is domain knowledge useful?
The domain knowledge is best useful in feature engineering. Feature engineering is creating features using the domain knowledge to optimize the machine learning algorithms.
we see that using feature engineering by applying domain knowledge gives a better accuracy score and lesser RMSE.

2.Data Collection
Data collected through webscraping technique and collected Unstructured data.Scrapped e-commerce site (Flipkart) for mobile specifications,collected Mobiles data and stored into a csv file for data processing.

3.Data processing
Data processing is the method of formatting information in a particular manner, which is easy to access and utilize data for analysis. It is the process of organizing and controlling a large amount of data. It is useful to deal with the important data in a synchronized manner and to do an analysis on each and every feature for Exploratory Data Analysis (EDA under comes data profiling, data cleaning or data pre-processing and data visualization ).

4.Exploratory Data Analysis
4.1.Data Profiling 4.2.Data Cleaning 4.3.Data Visualization
Lets start a practical Example following steps what we discussed above.

Importing packages

                      #python packages for data analysis
import pandas as pd
import numpy as np
import pandas_profiling as pp
                      #This mobiles data collected by webscraping with BeautifulSoup.
# loading data set.

df = pd.read_csv('mobiles_data.csv')  # <== here data loading using pandas. 
display(df.shape, df.head()) 

# this profile_report method using from pandas_profiling.

output of the Mobiles data (shown below image)

You will understand how to identify and remove the duplicate values from dataset.
Before dropping duplicate values we need to find out which values are exactly duplicated.
In this mobile data_set, we have duplicate values and we dropped those values using pandas as shown in below code. and we can identify easily using pandas_profiling which column has NA or missing values and duplicate values.

                      #Identifing duplicate records in the data
print(df.duplicated(subset=None, keep='first').sum())

# Dropping the duplicates
df1 = df.drop_duplicates(keep = 'first', inplace = False)
display(df.shape,         # df is complete data_set
        df1.shape)             #df1 after dropping duplicate values data_set


In this data_set we have 290 duplicate values and Next step we are removing duplicates.
after dropping duplicate values we have 4060 records with 15 features (or) Columns.

                      # Dropping NAs
display(df1.shape, df1.isnull().sum())

display(df1.isnull().sum().any(), df1.shape, df1.head(2))


colors 2 and Display 78 columns have duplicate values.(shown in below imag) incase in any column has lots of missing values we can drop that column.

Converting column names into lower case & Converting data into lower case.

                      #column names into lower case
df1.columns = [df1.columns.to_list()[i].lower() for i in range(len(df1.columns.to_list()))]

#data into lower case
df2 = df1.apply(lambda x: x.astype(str).str.lower())
display(df2.shape, df2.head(2))


In this article, we did how to load the dataset, and how to identify NA, missing values and duplicate values, how to remove them etc...

Next Article Data Analysis For Machine Learning with Python (part-3)