1. What is the problem that we are trying to solve? How will it impact the real world / bottom line? Does it have to be this way?
The Environmental Protection Agency (EPA) is attempting to limit the amount of ozone in counties
Limiting ozone would improve air quality and lessen the amount of damage done to citizens' lungs
Current standard is 75 ppb but proposed new limits would be between 60 and 70 ppb.
This way is the only way to ensure counties stay in compliance
New rule would be major victory for public health group
Further stoke partisan clashes between president and Republicans poised to take control of congress
2. How can I measure this? What data do I need? What is the appropriate algorithm and why? How will I evaluate this?
The ozone in a county is measured in parts per billion
Data needed were ozone parts per billion
The formula for ppb is = (grams of solute/grams of solution) * 10^9
3. What are the attributes/ variables/ features/ columns?
Columns (from left to right): County, State, Current reading of ozone ppb, Meets current standard, 70ppb, 65 ppb
4. What are the domains of each feature?
5. What does each row/ observation represent?
Each row is a different county
6. How was/ is the data collected?
Data was from EPA's most recent air quality data at the time (Nov. 2014)
7. What data are we ignoring, and does it have more features?
Population in each county (ppb might depend on whether there is dense population or not)
8. Is the data in a usable form? How does it need to be cleaned?
Data is available in CSV, JSON, XLS
It is organized from lowest ppb to highest ppb
Meets current standard, 70 ppb, and 65 ppb columns all have html tags to identify whether or not they meet the standards. Should be cleaned by replacing <i class="LATClose01"></i> with "no" and <i class="LATCheck" style="color:#009900;"></i> with "yes"
9. What metadata is available?
No metadata available for this specific dataset
10. Where did this data come from?
EPA Air Quality Data
11. What format is it in?
CSV, JSON, XLS
12. Who designed it and why?
EPA to provide researchers, public health professionals, and the public with air quality data
13. Who provided the data?
EPA
Graphical representation of data:
14. Is it titled and are the axes labeled with units?
It is titled "155 counties do not meet the current EPA standard"
Its one axis "Ozone concentration compared to selected standards" is labeled in ppb and distinguishes hues of blue for those counties within standards and yellow/brown for counties that do not meet the current EPA standard
15. Does convey a main idea/argument clearly?
Yes, although the colors used for counties that do not meet standard should be more attention-getting and aggressive to better represent the urgency of the matter
16. Is this the best format for the data presentation?
Yes, counties are easy to find and identify whether they meet standards or not
17. Is it pleasing to look at?
Yes
18. What data is not represented on it?
Population per county
19. Background: Who are the stakeholders? Who cares about this question/data in any significant way? For each: What are their main tasks and goals?
Researchers
Main tasks: Research ozone ppb in counties and its harmful effects
Goals: Spread awareness of ozone ppb in counties and its harmful effects
Public health professionals
Main tasks: Analyze and develop programs that protect the health of individuals in a community
Goals: Push for stricter standards to be enforced for the population's wellbeing
Government officials
Main tasks: To ensure the healthy, safety, and wellbeing of the population
Goals: Take more action against counties that are not meeting standards
Public
Main tasks:
Goals: Bring attention to what they believe is not being handled correctly/is unfair
20. Catalysts: What would make this stakeholder get involved, or rally around a project? What do they care about?
All stakeholders care about health; if population numbers were to be included in the data, it could display how many people are exposed to dangerously high ozone ppb
22. Benefits: How can data analytics improve the performance of different stakeholders?
Data analytics, in this specific case, can spread awareness of ozone's lung-damaging effects
Draw correlations between lung-related health issues in counties to their ozone ppb
Week 1: Read up on Data Analytics & Visualization, SPARQL, Big Data, etc. and research datasets
Datasets:
http://www3.epa.gov/ttn/analysis/ozone.htm
http://www.health.pa.gov/My%20Health/Environmental%20Health/Environmental%20Public%20Health%20Tracking/Pages/MetaData-Ozone.aspx#.VgfsmY9Viko
http://216.128.241.210/dataset?q=ozone+exceedance&sort=none&tags=ozone&ext_location=&ext_bbox=&ext_prev_extent=-142.03125%2C8.754794702435605%2C-59.0625%2C61.77312286453148
Technologies for Data Visualization:
http://bl.ocks.org/mbostock/2206590
http://bl.ocks.org/mbostock/4060606
http://bl.ocks.org/NPashaP/a74faf20b492ad377312
https://live2.zoomdata.com/zoomdata/visualization#53f22849e4b08f9d5f15360a-522655b0e4b00f4f3af30f12
https://vida.io/gists/FLFFovRPbu2t5QwQC
https://vida.io/gists/ot4Ynw4gZdmKkofo8
Below are slides from Professor Plotka's Web Science course.
Week 2: Data Studies
Data under study: http://spreadsheets.latimes.com/epa-tightens-regulations-ozone-pollution/
Related: http://www.latimes.com/nation/la-na-epa-ozone-rule-20141126-story.html#page=1, https://epafacts.com/high-cost-of-red-tape/smog/
1. What is the problem that we are trying to solve? How will it impact the real world / bottom line? Does it have to be this way?
2. How can I measure this? What data do I need? What is the appropriate algorithm and why? How will I evaluate this?
3. What are the attributes/ variables/ features/ columns?
4. What are the domains of each feature?
5. What does each row/ observation represent?
6. How was/ is the data collected?
7. What data are we ignoring, and does it have more features?
8. Is the data in a usable form? How does it need to be cleaned?
9. What metadata is available?
10. Where did this data come from?
11. What format is it in?
12. Who designed it and why?
13. Who provided the data?
Graphical representation of data:
14. Is it titled and are the axes labeled with units?
15. Does convey a main idea/argument clearly?
16. Is this the best format for the data presentation?
17. Is it pleasing to look at?
18. What data is not represented on it?
19. Background: Who are the stakeholders? Who cares about this question/data in any significant way? For each: What are their main tasks and goals?
20. Catalysts: What would make this stakeholder get involved, or rally around a project? What do they care about?
22. Benefits: How can data analytics improve the performance of different stakeholders?
Week 3: Data Visualization
http://cdb.io/1PwvKXN