In this new week i am learning all about the process phase of data analytics
‼️ Recall the 6 phases of data analytics
- ASK PHASE
- PREPARE PHASE
⭕ PROCESS PHASE
- ANALYZE PHASE
-SHARE PHASE
- ACT PHASE
In the lesson of today, i learnt all about DATA INTERGRITY.
It can be defined as the ACCURACY, COMPLETENESS, CONSISTENCY AND TRUSTWORTHYNESS of data throughout its life cycle.
One missing piece can make all of your data useless. Data can also be compromised, and they can be compromised in many ways. They can be compromised during:
➡️ DATA REPLICATION which is the process of storing data in multiple
locations
➡️ DATA TRANSFER which is process of copying data from a storage
device to memory or from one computer to another If the transfer
is interupted, you may end up with incomplete dataset.
➡️ DATA MANIPULATION process involves changing data to make
it organized and easier to read.
Data can also be compromised through
➡️ Human error
➡️ Viruses
➡️ Malware
➡️ Hacking
➡️ System failures
It is important to check that the data you use aligns with the business obectives.
When you are getting ready for data analysis, you might realize you dont have the data you need or you dont have enough of it In some cases you use what is called PROXY DATA in place of the real data.
There are different types of insufficient data which include;
💠 Data from only one source
💠 Data that keeps updating
💠 Outdated data
💠 Geographyical limited data
There are ways we can address insufficient data problems.
💠 Identify trends with the available data
💠 Wait for more data if time allows
💠 Talk with stakeholders and adjust your objectives
💠 Look for new data set
I also learnt about sample size calculation
- POPULATION: The entire group that you are interested in for study
- SAMPLE: A subset of your population
- MARGIN OF ERROR: Since a sample is used to represent a population
the sample results are expected to differ from what the results would
have been if you had surveyed the entire population.The difference is
called MARGIN OF ERROR
- CONFIDENCE LEVEL: How confident you are in the survey results
- CONFIDENCE INTERVAL: The range of possible values that the
population's result would be at the confidence level of the study.
This range is the sample results +/- the margin of error.
- STATISTICAL SIGNIFANCE: The determination of whether your result
could be due to random or not. The greater the signifance, the less
due to chance
i learned about the task you should complete before analyzing data which are;
- Determine data intergrity by assessing the overall accuracy, consistency
completeness of the data.
- Connect objectives to data by understanding how your business
objectives can be served by an investigation in to the data.
- know when to stop collecting data.
STATISTICL POWER is the probabilty of geting meaningful results from a test
HYPOTHESIS TESTING is a way to see if a survey experiment has meaningful results.
DATA CONSTRAINTS The criteria that determines if a piece of data is clean and valid.
DATEDIF is a spreadsheet function that calculates the number of days, months, or years between 2 dates
ESTIMATED RESPONSE RATE is the average number of people who typically complete a surey
REGULAR EXPRESSION (RegEx) is a rule that says the values in a table must match a prescribed pattern
The most common cause of quality data is HUMAN ERROR
Clean data is data that is complete, correct and relevant to the problem you're trying to solve.
Dirty data is data that is incomplete, incorrect and irrelevant to the problem you are trying to solve. There are different types of dirty data which include;
➡️ Duplicate data which is any data record that shows up more than
once which could be caused by manual data entry, batch data
imports or data migration.
➡️ Outdated data is any data that is old which should be replaced
with newer ad more accurate information.
➡️ Incomplete data is any data that is missing important fields
➡️ Incorrect or inaccurate data is any data that is complete
but inaccurate which could be caused by Human Error inserted
during data input, fake information or mock data
➡️ Inconsistent data is data that uses different formats to represent
the thesame thing which could be caused by data stored
incorrectly or inserted during data transfer.
Other causes of data include;
➡️ Manual data entry
➡️ Error batch data imports
➡️ Data migration
➡️ Software obsolescence
➡️ Improper data collection
➡️ Human errors during data input
DATA VALIDATION is a tool for checking the accuracy and quality of data before adding or importing it. There are steps you have to follow to clean data
💠 Removing unwanted data
💠 Getting rid of duplicates or data that is irrelevant to what you want to
solve
💠 Fixing misspellings
💠 Fixing inconsistent capitalization
💠 Fixing punctuation and other typos
DATA MERGING is the process of combing 2 or more datasets into a single dataset
DATA COMPATIBILITY describes how well 2 or more datasets are able to work together
Common errors you should avoid are
💠 Not checking spelling errors
💠 Forgetting to document errors
💠 Not checking for misfielded values when values are entered in to the
wrong field
💠 Overlooking missing values
💠 Only looking at the subset of the values
💠 Loosing track of business objectives
💠 Not analyzing the system prior to data cleaning
💠 Not fixing the source of the error
💠Not backing up your data prior to data cleaning
💠 Not accounting for data cleaning in your deadlines/process.
Today i also did 3 EXCEL exercises on data cleaning and it was very exciting and i will continue to practice everyday.
i did more of practical work like cleaning of data in microsoft excel using different datasets.
i used different functions to make sure that my data was clean ( removing duplicates, made sure the data format was consistent, removed unwanted spaces, split text and numbers to make reading easy)
I used different excel functions to achieve my objective of having clean data;
💠 TRIM
💠 VLOOKUP
💠 SPLIT TEXT TO COLUMN
💠 DATA VALIDATION
💠 FILTER
💠 SPLIT
💠 CONDITIONAL FORMATTING
💠 PIVOT TABLES
💠 PASTE SPECIAL
i learnt how to clean data using SQL
➡️ REMOVING DUPLICATES USING "DISTINCTION"
➡️ USED "SUBSTRING" TO GET ONLY THE FIRST 2 LETTERS OF A COUNTRY
➡️ USED "LENGTH" TO KNOW HOW MANY LETTERS ARE USED TO REPRESENT A COUNTRY
➡️ USED THE "TRIM" FUNCTION TO REMOVE SPACES
I also did a few exercises to learn how to clean and adjust my data using
➡️ UPDATE
➡️ SET.
DATA VERIFICATION is a process to confirm that a data-cleaning effort was well executed and the resulting data is accurate and reliable.
CHANGELOG is a file containing a chronological list of modifications made to a project. They are usually organized by version and include the date followed by a list of added, improved and removed features.
See the big picture when verifying data cleaning
💠 Consider the business problem
💠 Consider the goal
💠 Consider the data
Data cleaning check list. Correct the most common problems
➡️ Sources of error
➡️ Null data (search using conditional formatting and filters)
➡️ Misspelled words
➡️ Mistyped numbers
➡️ Extra spaces and characters (using TRIM functions)
➡️ Duplicates (distinct in SQL or remove duplicates function in excel or
spreadsheet )
➡️ Messy (inconsistent) strings
➡️ Messy (inconsistent) data formats
➡️ Misleading variable labels (column)
➡️ Truncated or missing data
➡️ Business logic (check that the data makes sense given your knowledge
of the business
➡️ Review the goal of your project to make sure that your data still aligns
with the goal.This is a continuous process that you will do throughout
your project.
DOCUMENTATION
The process of tracking changes, additions, deletions and errors involved in your data cleaning efforts example CHANGELOG
⭕ Recover data cleaning errors to use whenwe come accross a similar
problem or when you want to redo the cleaning
⭕ Inform other users of changes you have made (reference sheet for other
analyst incase you resign)
⭕ Determine quality of data.
Author Of article : Awasume Marylin Read full article