A customer asked me the other day about a method to do some data profiling against a file system. So I thought I would share the request, my suggestion and little PowerShell script I crafted to do some data profiling. The request read as follows: “Do you know of any tools that will list all files in a directory, and all subs and provide attributes like filename, path, owner, create date, modify date, last access date and maybe some other attributes in report format?”
My recommendation was a commercial product called TreeSize Professional by JAM Software the product license cost is ~ $50 and worth every penny, scan speeds are good, supports UNC paths and reporting is intuitive. Overall an excellent product.
As an alternative below is a quick PowerShell script (also attached to post as data_profile.ps1) that will create a CSV file with data profiling information, once the CSV file is create the CSV can be opened in Excel (or your spreadsheet tool of choice) or imported into a DB and manipulated.
$root = "c:\\files"
$report = ".\report.csv"
$AllFiles = @()
foreach ($file in get-childitem $root -recurse| Select-Object FullName, Root, Directory, Parent, Name, Extension, PSIsContainer, IsReadOnly, Length, CreationTime, LastAccessTime, LastWriteTime, Attributes)
{
$acl = get-acl $file.fullname | select-object path,owner,accesstostring,group
$obj = new-object psObject
#$obj | Add-Member -membertype noteproperty -name FilePathandName -Value $file.FullName
$obj | Add-Member -membertype noteproperty -name Root -Value $file.Root
$obj | Add-Member -membertype noteproperty -name Ditrectory -Value $file.Directory
$obj | Add-Member -membertype noteproperty -name Parent -Value $file.Parent
$obj | Add-Member -membertype noteproperty -name Name -Value $file.Name
$obj | Add-Member -membertype noteproperty -name Extension -Value $file.Extension
$obj | Add-Member -membertype noteproperty -name IsDIR -Value $file.PSIsContainer
$obj | Add-Member -membertype noteproperty -name IsReadOnly -Value $file.IsReadOnly
$obj | Add-Member -membertype noteproperty -name Size -Value $file.Length
$obj | Add-Member -membertype noteproperty -name CreationTime -Value $file.CreationTime
$obj | Add-Member -MemberType noteproperty -Name LastAccessTime -Value $file.LastAccessTime
$obj | Add-Member -MemberType noteproperty -Name LastWriteTime -Value $file.LastWriteTime
$obj | Add-Member -MemberType noteproperty -Name Attributes -Value $file.Attributes
#$obj | Add-Member -MemberType noteproperty -Name Path -Value $acl.path
$obj | Add-Member -MemberType noteproperty -Name Owner -Value $acl.owner
$obj | Add-Member -MemberType noteproperty -Name AccessToString -Value $acl.accesstostring
$obj | Add-Member -MemberType noteproperty -Name Group -Value $acl.group
$AllFiles += $obj
}
$AllFiles |Export-Csv $report –NoTypeInformation
The above script scans all files recursively starting at c:\files and outputs the results to results.csv. One thing to note is that the scan stores all data in an array in memory, the is because the PowerShell Export-Csv function does not support appending to a CSV file (you gotta wonder what Microsoft talks about in design meetings). I will likely create a version of the script that uses the out-file function to write each row to the csv file as the scan happens rather then storing in memory until the scan is completes and then writing the entire array to the report.csv file, goal here is to reduce the memory footprint during large scans.
The output of this file script will be similar to the following: