Getting Started with Python in Power BI

Home / Blog / Getting Started with Python in Power BI
powerbi_python

Python scripts can be run directly in Power BI Desktop. Besides Power Query, we can use Python to load and transform our data. It can also be used during the report to create Python visuals. We will use a dataset from StackExchange to demonstrate how python can be used in these different steps.

The dataset

We will focus our analysis on the Data Science Stack Exchange website. Stack Exchange is in a nutshell a collection of websites on multiple fields and subjects. Each website is a collection of posts ( questions and answers ). The posts can be upvoted and downvoted, and they are associated to tags regarding the subject. Stack Exchange provides a public database for each of its websites. Here's a link to query and explore Data Science database. or example, we can query the most recent posts in the database.

SELECT TOP 5 Id, Title, ViewCount, Body, Tags
  FROM Posts
 ORDER BY CreationDate DESC;

After clicking Run query button, we get the results of our query

run_query_result

In this example, we will work on the subset of the database from the year 2019, already available in a csv format. The document can be downloaded here, to follow along.

Load and transform the data

After opening Power BI Desktop, click on Get data in the Home tab, choose the csv file, and load the data. The subset contains 7 columns.

load_dataset_csv_2

Let's focus on the Tags column. We see that each row contains all the tags of the post, enclosed in <>. We need to apply some data cleaning ( remove <>) to have the data in an interesting format. Also the FavoriteCount column has some missing values, thta need to be addressed.

To use python with Power BI, we first need to have python installed, if not. Go to the official website and get the latest version. Follow the instructions to install python, and keep the folder path where it is installed. After, in Power BI, navigate to File -> Options and settings -> options, and click on Python scripting. Set the path where python has been installed if it is not auto detected. As an option, you can choose to write your script in your preferred IDE

python_path

Now on Home, click on Transform Data. When the Power Query editor opens, click on Transform tab, and Run python script .

python_script

Python uses the pandas library, which is a very popular open source tool for data analysis. The table will be loaded in an object called a dataframe. The variable name is dataset, as indicated in the comment ( line starting with #)

Data cleaning

Let the fun begin! To deal with missing values, we can use the fillna() method on the dataframe. For the column FavoriteCount, we can fill missing values with 0:

dataset["FavoriteCount"] = dataset["FavoriteCount"].fillna(0)

To clean the Tags column we can replace the characters we want to remove with "," or simply remove them:

dataset["Tags"] = dataset["Tags"].str.replace("><",",").str.replace("[<>]","",regex=True)

The following script is then created, and we click Ok to execute it.

data_cleaning

You will see a warning displayed about data privacy. For now, we check the ignore Privacy levels checkbox and click "save" button. If there is no error in the script , we should get the dataset in a table like this

dataset_table_to_load

We have to click on table field to load the result of the script in our original table.

data_clean_result

Add a new column

Imagine we want to evaluate in avarage how many tags a post received. We can create a column that holds the number of tags for each post. It is a simple task to achieve with Pandas, as we will apply a function to a column to create a new column

# 'dataset' holds the input data for this script
dataset["NbTags"] = dataset["Tags"].apply(lambda x: len(x.split(",")))

The cleaning we did before allows us to retrieve easily the number of tags per post. After running the script, and loading the dataset, we obtain the table with the new column created.

table_new_column

We can finally apply our changes, and close the query editor.

Now let try to answer some questions we may have related to the tags:

  • What is the average number of tags per post?
  • How many times each tag was used?
  • How many times each tag was viewed?

Python visuals in Power BI

The average number of tags per post will be easily displayed with a card. However, we need to do some calculation to answer the last 2 questions. We will use the Python visual. Navigate to the report page and click on the Py symbol. Click the Enable button on the pop up. A python script editor is opened on the bottom. Choose the tags and ViewCount columns. A dataset is created with the 2 columns chosen, and we can create the visuals with the python library.

We need to install pandas and matplotlib library for the visualization. Open a terminal, navigate to python installation folder, and install the package with the following lines:

pip install pandas
pip install matplotlib

Let add the following code to create a new dataframe that contains the counts and the views for each tag, and also plot the top 10 tags with the maximum of counts.

import matplotlib.pyplot as plt
dictTag = {}
viewTag = {}
resultTag = {}
def getTagValue(row):
    #print(row)
    listTag = row["Tags"].split(',')
    for elt in listTag:
        if elt in dictTag.keys():
            dictTag[elt] += 1
            viewTag[elt] += row["ViewCount"]
        else:
            dictTag[elt] = 1
            viewTag[elt] = row["ViewCount"]
    for key in dictTag.keys():
        resultTag[key] = [dictTag[key], viewTag[key]]
            
dataset.apply(getTagValue, axis=1)
pdTags = pandas.DataFrame.from_dict(resultTag, orient='index', columns=['count','view'])
head10 = pdTags.sort_values(by="count",axis=0, ascending=False).head(10)
plt.figure()
head10.plot.barh(y='count')
plt.show()

It is important to note that a script visual editor shall always end by showing a figure. We can duplicate the visual and plot also the top 10 tags regarding the views

result visual editor

We thus obtain the information we were looking for.

This is just a simple example to demonstrate the use of python in Power BI. On the visualization side, you should consider a library like seaborn to really enhance the visuals natively present in Power BI.

Categories: power bi python

Share

No Comment

Leave a reply

*Required for comment verification