Data Cleaning Using SQL
Project Overview: In this project, I performed comprehensive data cleaning and transformation on the Nashville Housing dataset using SQL. The goal was to enhance the dataset's usability for analysis by addressing various data quality issues.
Skills and Concepts
- Standardizing Date Formats: Converted inconsistent date formats to a standardized date format using the CONVERT function. Added a new column
SaleDateConverted
and populated it with converted date values. - Populating Missing Property Address Data: Identified and populated missing property addresses by joining the dataset on
ParcelID
and using theISNULL
function to fill in gaps. - Breaking Out Address into Individual Columns: Split the
PropertyAddress
field into separate columns for address, city, and state using string functions likeSUBSTRING
andCHARINDEX
. Added and updated columnsPropertySplitAddress
,PropertySplitCity
, andPropertySplitState
. - Handling Owner Address Data: Applied similar transformations to the
OwnerAddress
field, breaking it intoOwnerSplitAddress
,OwnerSplitCity
, andOwnerSplitState
. - Updating Boolean Fields: Standardized boolean fields, converting 'Y' and 'N' values to 'Yes' and 'No' respectively, using the
CASE
statement. - Removing Duplicates: Implemented a Common Table Expression (CTE) to identify and remove duplicate records based on key fields like
ParcelID
,PropertyAddress
,SalePrice
,SaleDate
, andLegalReference
. - Deleting Unused Columns: Dropped unnecessary columns such as
OwnerAddress
,TaxDistrict
,PropertyAddress
, andSaleDate
to streamline the dataset.
Conclusion
This project demonstrates essential data cleaning techniques and SQL proficiency, valuable for data analysis and business intelligence. Key tasks included standardizing date formats, populating missing property addresses, splitting address fields into individual columns, handling owner address data, standardizing boolean fields, removing duplicates, and deleting unused columns.
GitHub Project
You can view the full project on GitHub here.