Problem Statement:

We have the data of purchases of an electronic store. The data roughly contains 186848 rows broken down by month, product type, cost, purchase address, etc. Now we have to dive deep into this sales data and answer five high level business questions:

1- What was the best month for sales? How much was earned that month?

2- Which city sold the most products?

3- What time should we display advertisements to maximize the likelihood of customer’s buying the product?

4- What products are most often sold together?

5- What product sold the most? Why do you think it sold the most?

Data Science tools we will use to answer above questions:

The main tools we will use are Python Pandas and python Matplotlib. We will also go through the different methods of the mentioned tools:

1- Adding different columns to our data frame.

2- Parsing cells as strings to make new columns.

3- We will use the .apply() method.

4- Using groupby to perform aggregate analysis.

5- We will use matplotlib to plot bar charts and line graphs to visualize our results.

6- Labeling graphs.

Dataset:

You can access and download the dataset here. The data set you will find is in different CSV files month wise. For example, the data for the month of April is in one file and for may is in another, and so on. So first we have to merge all the CSV files into one file so that we can analyze all the data at once. Below is a short tutorial on merging different CSV files into one CSV file.

Merging Different CSV files into one CSV file Using Command Line:

Once you download the data set with multiple CSV files to your computer the second thing is to merge those different CSVs into one CSV file. To do so follow the below steps:

1- Make a new folder and copy all the CSV files into that folder. In the below screenshot, you can see I have made a folder called Sales_Data and I have copied all the CSV files into this folder.

2- In the second step you have to open your terminal and navigate to the folder where you have kept all your CSV files. In my case, I will move into the folder Sales_Data. In the below screenshot, you can see I have navigated to the Sales_Data.

3- To combine all the different CSV files data into one data set you have to write the following command. copy *.csv merged-CSV-files.csv.

merging all the CSV files into one CSV file called dataset.csv

4- By running the above command all the files have been copied into one CSV file and you can now use it to perform different operations.

a new CSV file has been created called dataset.csv where all the data has been copied

Now we have the data we can start to work on it.

import pandas as pd # Importing pandas library to read the data
data = pd.read_csv("/content/drive/MyDrive/sales_data.csv") # Reading the data set into the pandas dataframe, by giving path to the dataset
data.head() # This line of code will output the first five rows of the dataset
# Augment the data with some additional columns for this we have ato add a seprate month column to our existing dataframe
data = data.dropna() # dropping all the NAN values fro the data set
data['Month'] = data['Order Date'].str[0:1] # adding an extra column of month, so we can easily use it for further analysis
data['Month'] = data['Month'].astype('int32') # Converting the value of the month column form string to int
data.head()

In the above screenshot now we can see there the ‘NaN’ has been dropped form the dataset and a new column called month has been added which is int type.

Leave a Reply

Your email address will not be published. Required fields are marked *