Aspiring data analysts often create their own projects to improve their capabilities and gain hands-on experience working with different programming languages. Working on data analysis projects in your free time can also provide you with plenty of material to enhance your portfolio, resume and application materials. If you're preparing to apply for roles as a data analyst, then you might be interested in learning about the types of projects you can develop for your resume to impress hiring managers.
In this article, we explain what qualifies as a data analyst project, list 14 types of data analyst projects you can choose from and explain how to include these projects on your resume.
Doing data analysis projects is critical to landing a job, as they show hiring managers that you have the skills for the role. Professionals in this field must master a myriad of skills, from data cleaning and data visualization, as well as programming languages like SQL, R, and Python. A data analysis project can demonstrate your aptitude with all of these skills. Furthermore, personal projects are a great way to practice a variety of data analysis techniques, especially if you lack real-world experience.
Projects are an excellent way to gain experience with the end-to-end data analysis process, especially if you’re new to the field of data analysis. Here are some great project ideas for beginners:
Reddit is a popular repository for web scraping because of the sheer amount of data available— from qualitative data in posts and comments to user metadata and engagement with each post.
Subreddits on Twitter enable you to extract posts on specific topics. PRAW is a Python package you can use to access Reddit’s API to scrape the subreddits you’re interested in (a Reddit account is required to get an API key). You can then extract data from one or more subreddits at a time. If you’d rather not scrape your own data, you can find Reddit datasets on data.world.
If you’re interested in real estate, you can use Python to scrape data on real-estate properties, then create a dashboard to analyze the “best” properties based on data points like property taxes, population, schools, and public transportation. There are two main Python libraries for data scraping: Scrapy and BeautifulSoup. You can also use the Zillow API to obtain real estate and mortgage data.
Another great project for beginners is to do an exploratory data analysis (EDA), which is the probing of a dataset to summarize its main characteristics. EDA helps determine which statistical techniques are appropriate for a given dataset. Here are some projects where you can work on your EDA chops:
McDonald’s food items are often controversial because of their high fat and sodium content. Using this dataset from Kaggle, you can perform a nutrition analysis of every menu item, including salads, beverages, and desserts. First, import the CSV file in Python. Then, categorize items according to factors like sugar and fiber content. Then you can model the results using bar and pie charts, scatter plots, and heatmaps. For this project, you’ll need the Numpy, Pandas, and Seaborn libraries.
The World Happiness Report surveys happiness levels around the globe. This project, from a student at Pennsylvania State University, uses SQLite, a popular database engine, to analyze the difference in happiness levels between the North and South hemispheres.
Visualizations communicate trends, outliers, and patterns in your data. So if you’re new to the field, and looking for a data analysis project, then creating visualizations is a great place to start. Select graphs that are ideal for the story you’re trying to tell. Bar charts and line charts succinctly illustrate changes over time, while pie charts model part-to-whole comparisons. Meanwhile, bar charts and histograms show the distribution of data. Here are some great data visualization projects for beginners:
Data visualizations are a great way to illustrate historical events, such as the spread of the printing press or trends in coffee production and consumption. This visualization by Harvard Business School depicts the largest US companies in the year 1955. A second analysis in 2015 shows how much has changed. There is also an abundance of datasets available on World War II. This Kaggle dataset features data on weather conditions during the war, which had a major influence on the success of an invasion.
Modern telescopes and satellites produce digital images that are perfect for data visualization. This dataset from data.world shows future asteroids poised to pass near Earth within the next 12 months, as well as those that have made a close approach within the last 12 months. You can view live visualizations based on the dataset here to inspire your own analysis. You can also use this resource to find the asteroid orbital classes for each data point (eg: asteroid, apollo, centaur).
This project on KDNuggets makes use of Jupyter notebooks and IPython to analyze Instagram data. Regular Python works fine, but you may not be able to display the images in your notebook. You can use Instagram data to compare the popularity of two political candidates, like this project, or perform a time series analysis on a public figure’s popularity before and after a major event.
Sentiment analysis (AKA “opinion mining”) entails using natural language processing (NLP) to determine how people feel about a product, public figure, or political party, for example. Each input is assigned a sentiment score, which classifies it as positive, negative, or neutral. You’ll definitely want to hone this skill to land a job in data analysis. Here are some great projects to add to your portfolio:
Google reviews are a great resource for customer feedback, and also make for a great data analysis project. The Google My Business API lets you extract reviews and work with location data. In this project on Medium, data enthusiast Nikita Bhole used Python to perform a sentiment analysis on user reviews from the Google Playstore. She then used Pandas profiling to perform an exploratory data analysis to find variables, interactions, correlations, and missing values. Next, she used TextBlob to calculate a sentiment score based on sentiment polarity and subjectivity.
Quora is one of the most popular question-and-answer websites in the world, making it ripe for data analysis. In a recent Kaggle challenge, users were tasked with using advanced NLP to classify duplicate question pairs. For example, the queries “What is the most populous state in the USA?” and “Which state in the United States has the most people?” should not exist separately on Quora. This dataset from Quora contains over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair. In this project conducted by a group of NYU students, a basic linear model known as an n-gram was used to build a set of features to be used in a natural language understanding (NLU) model. Then they used scikit’s Support Vector Machine (SVM) implementation module for their experiments with word embedding.
Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate, or incomplete data within a dataset. Messy data leads to unreliable outcomes. Cleaning data is an essential part of data analysis, and demonstrating your data cleaning skills is key to landing a job. Here are some projects to test out your data cleaning skills:
Airbnb’s open API lets you extract data on Airbnb stays from the company’s website. Alternatively, you can use this existing Kaggle dataset for Airbnb stays in New York City in 2019. Both data files include all the information needed to find out more about hosts and geographical availability, both of which are necessary metrics to make predictions and draw conclusions.
The top trending videos on YouTube provide an itinerant window into the current cultural zeitgeist. This dataset from Kaggle contains several months of data on daily trending YouTube videos from different countries. This includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count. Once cleaned, you could use this data for:
If you’re at the intermediate level and want to advance your data analysis career, you’ll want to improve your skills in data mining, data science, data collection, data cleaning, and data visualization. Here are some great projects to add to your portfolio:
Data mining is the process of turning raw data into useful information. Here are some data mining projects that you can do to advance your career as a data analyst:
Speech recognition programs identify spoken words and convert them into text. To do this in Python, install a speech recognition package such as Apiai, SpeechRecognition, or Watson-developer-cloud. This project, which is called DeepSpeech, is an open-source speech-to-text engine using Google’s TensorFlow.
While streaming recommendation engines are useful, why not build a recommendation engine for a niche genre? This crowd-sourced dataset from Kaggle contains information on user preference data from 73,516 users on 12,294 anime shows. You can categorize similar shows based on reviews, characters, and synopses to build different recommendation algorithms.
A chatbot uses speech recognition to understand text inputs (chat messages) and generate responses. You can build a chatbot using the Natural Language Toolkit (NLTK) library in Python. Chatterbot is an open-source machine learning dialog engine on Github that lets anyone contribute dialog. Each time a user enters a statement, the library saves the text they entered. As Chatterbot receives more input, it learns to provide more varied responses with increasing accuracy.
Data collection is the process of gathering, measuring, and analyzing data from a variety of sources to answer questions, solve business problems, and investigate hypotheses. An effective data analysis project shows proficiency in all stages of the data analysis process, from identifying data sources to visualizing data. Here’s a project to advance your data collection, cleaning, and visualization skills:
The Apple Watch collects different types of workout data, including total calories burned, distance (for walking and running), average heart rate, and average pace. Using processed data, you can create visualizations such as rolling mean step count or step counts by days of the week, as seen in this project by full-stack engineer Mark Koester.
Machine learning enables computers to continuously make predictions based on the available data without being explicitly programmed to do so. These algorithms use historical data as input to predict new output values. Here are some common machine learning projects you can try out:
Machine learning uses models for fraud detection that continuously learn to detect new threats. This project for credit card fraud detection uses Amazon SageMaker to train supervised and unsupervised machine learning models, which are then deployed using Amazon SageMaker-managed endpoints.
Recommendation engines use data from user preferences and browsing history. To build a movie recommender, you can use this dataset from MovieLens, which contains 105,339 ratings applied to over 103,000 movies. Follow each step in more detail here.
Natural language processing (NLP) is a branch of AI that helps computers interpret and manipulate natural language in the form of text and audio. Try adding some of these NLP projects to your portfolio to land a more senior-level position:
You can build a web application that translates news from one language to another using Python. In this project, data scientist Abubakar Abid used the Newspaper3k, a Python library that lets you scrape almost any news site. Then, he used the HuggingFaceTransformers, a state-of-the-art natural language model, to translate and summarize news articles from English to Arabic (you can choose another target language if desired). Finally, Abid deployed the Gradio library to build a web-based demo where he tried out the algorithm on different topics.
Deep learning is concerned with neural networks comprising three or more layers. These artificial neural networks are inspired by the structure and function of the human brain. Practice your deep learning skills with these projects:
Breast cancer classification is a binary classification problem that works by categorizing biopsy photographs as benign or malignant. This project uses a convolutional neural network (CNN) to identify high-level features in the input images and implement matrix computations to infer a feature map.
Image classification models can be trained to recognize specific objects or features. You can build one using a CNN in Keras with Python. This project uses the CIFAR-10 dataset, a popular computer vision dataset consisting of 60,000 images with 10 different classes. The dataset is already available in the datasets module of Keras, so you can directly import it from keras.datasets.
Regardless of your level or skillset, data analysts can always improve on the following skills:
SQL is mainly used for storing and retrieving data from databases, writing queries, and modifying the schema (structure) of a database system. In your data analysis project, be sure to make use of some of the most important SQL commands, such as SELECT, DELETE, CREATE DATABASE, INSERT INTO, ALTER DATABASE, CREATE TABLE, and CREATE INDEX.
While data analysts don’t need to have advanced coding skills, the ability to program in R or Python lets you use more advanced data science techniques such as machine learning and natural language processing.
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incomplete, duplicated, incorrect, or improperly formatted. Fixing spelling and syntax errors, standardizing naming conventions, and correcting mistakes are key skills.
As a data analyst, it’s important to communicate your findings with strong visuals that appeal to both technical and non-technical stakeholders. To visualize your data effectively, you need to know the specific use cases for each type of visual, from bar charts to histograms and more.
Data analysts use Excel and other spreadsheet tools to sort, filter, and clean their data. Excel is also a useful tool for doing simple calculations (eg: SUMIF and AVERAGEIF) or combining data using VLOOKUP.
Listing the data analyst projects you've worked on can help you develop a unique resume that sets you apart from other candidates who may have similar work experiences and academic backgrounds. Here are some steps to help you include data analyst projects on your resume:
1. Review the job description. Identify what skills the hiring manager is looking for and then select relevant projects you've worked on that demonstrate your capabilities in these areas.
2. Determine where to list your projects. If you have a significant number of projects you'd like to highlight, consider creating a separate projects section on your resume. Otherwise, consider including projects underneath in the work experience or education sections.
3. Include a link to your online portfolio. You can include this link along with your contact information to encourage hiring managers to explore the projects you've worked on in the past.
We at Alphaa AI are on a mission to tell #1billion #datastories with their unique perspective. We are the community that is creating Citizen Data Scientists, who bring in data first approach to their work, core specialisation, and the organisation.With Saurabh Moody and Preksha Kaparwan you can start your journey as a citizen data scientist.