21/10/19 20 minutes read 491 Naren Allam
In this article, you will learn practically how to prepare data for analysis, the basics of data analysis using python packages, creating meaningful data visualizations, predict future trends from data, and much more..!
To Know theoretically about data analysis follow python guru article
Data Analysis For Machine Learning with Python (part-1)
4.Exploratory Data Analysis
For data analysis, in this article, the data requirement type is the product requirement. So here, the product is mobile and here required domain knowledge on mobiles, to understand about mobile features.
How does domain knowledge influence data Analysis?
You may have studied data science and machine learning and used some machine learning algorithms like regression, classification to predict on some test data. But the true power of an algorithm and data can be harnessed only when we have some form of domain knowledge. Needless to say, the accuracy of the model also increases with the use of such knowledge of data.
For example, the knowledge of the mobile industry when working with the relevant data can be used like — Let’s say we have 15 features in this data set.
Where is domain knowledge useful?
The domain knowledge is best useful in feature engineering. Feature engineering is creating features using the domain knowledge to optimize the machine learning algorithms.
we see that using feature engineering by applying domain knowledge gives a better accuracy score and lesser RMSE.
Data collected through webscraping technique and collected Unstructured data.Scrapped e-commerce site (Flipkart) for mobile specifications,collected Mobiles data and stored into a csv file for data processing.
Data processing is the method of formatting information in a particular manner, which is easy to access and utilize data for analysis. It is the process of organizing and controlling a large amount of data. It is useful to deal with the important data in a synchronized manner and to do an analysis on each and every feature for Exploratory Data Analysis (EDA under comes data profiling, data cleaning or data pre-processing and data visualization ).
4.Exploratory Data Analysis
Lets start a practical Example following steps what we discussed above.
output of the Mobiles data (shown below image)
You will understand how to identify and remove the duplicate values from dataset.
Before dropping duplicate values we need to find out which values are exactly duplicated.
In this mobile data_set, we have duplicate values and we dropped those values using pandas as shown in below code. and we can identify easily using pandas_profiling which column has NA or missing values and duplicate values.
In this data_set we have 290 duplicate values and Next step we are removing duplicates.
after dropping duplicate values we have 4060 records with 15 features (or) Columns.
colors 2 and Display 78 columns have duplicate values.(shown in below imag) incase in any column has lots of missing values we can drop that column.
Converting column names into lower case & Converting data into lower case.
In this article, we did how to load the dataset, and how to identify NA, missing values and duplicate values, how to remove them etc...
Next Article Data Analysis For Machine Learning with Python (part-3)