Data Science

RSS feed for this section

Trends in Atmospheric NO and NO2 Concentrations

As part of my Data Analysis in R course on Udacity, I’m publishing the results of an EDA I did on atmospheric nitric oxide and nitrogen dioxide concentrations somewhere in Cambrige (UK).

You can find the datasets here.

I plotted the levels of NO in the atmosphere over the span of several days, and noticed that it tends to have a daily cycle. The levels probably go down during the night, and go back up when it warms up during the day :

Rplot02

Viewed on a larger scale, the mean levels of NO seem to have no noticeable trend :

Rplot01

 

Plotting the levels of NO vs the levels of NO2, a noticable positive correlation emerges:

Rplot

Taking the Pearson’s product-moment correlation of NO and NO2 concentrations reveals a value of 0.6968, which supports the observation.

Here’s the R code that produced these plots, for those of you that are interested:


#http://www.airqualityengland.co.uk/local-authority/data?la_id=51

library(ggplot2)
library(dplyr)
library(grid)
library(gridExtra)

ds1 <- read.csv("2014-05-07-141107012512.csv")
ds2 <- read.csv("2014-08-05-141107012512.csv")
ds3 <- read.csv("2014-11-04-141107012512.csv")
dataset <- rbind.data.frame(ds1, ds2, ds3)


dataset$timestamp <- as.numeric(strptime(paste(dataset$End.Date,dataset$End.Time), format = "%d/%m/%Y %H:00:00"))
dataset <- dataset[!is.na(dataset$timestamp ), ]

dataset$hour <- dataset$timestamp / 3600
dataset$hour <- dataset$hour - min(dataset$hour)

dataset$day <- round(dataset$timestamp / 86400)
dataset$day <- dataset$day - min(dataset$day)

sp1 <- ggplot(aes(x = hour, y = NO), data = cleanData) +
 ylim(c(0, 150)) +
 geom_line(color = "#334455") + 
 scale_x_continuous(breaks = seq(0, 200, 24), limits = c(0, 200)) +
 labs(x = "Hour Since Start", y = "Nitric Oxide Concentration")

sp2 <- ggplot(aes(x = hour, y = NO), data = cleanData) +
 ylim(c(0, 150)) +
 geom_line(color = "#334455") + 
 scale_x_continuous(breaks = seq(0, 500, 24), limits = c(0, 500)) +
 labs(x = "Hour Since Start", y = "Nitric Oxide Concentration")

grid.arrange(sp1, sp2)



dataset.by_day <- dataset %>%
 group_by(day) %>%
 summarise(mean_NO = mean(NO))

sp1 <- ggplot(aes(x = day, y = mean_NO), data = dataset.by_day) +
 ylim(c(0, 100)) +
 geom_line(color = "#334455") +
 scale_x_continuous(breaks = seq(0, 200, 7), limits = c(0, 50)) +
 labs(x = "Days Since Start", y = "Mean NO Concentration")

sp2 <- ggplot(aes(x = day, y = mean_NO), data = dataset.by_day) +
 ylim(c(0, 100)) +
 geom_line(color = "#334455") +
 scale_x_continuous(breaks = seq(0, 200, 7), limits = c(0, 200)) +
 labs(x = "Days Since Start", y = "Mean NO Concentration")

grid.arrange(sp1, sp2)




sp1 <- ggplot(aes(x = NO, y = NO2), data = dataset) + 
 xlim(c(0, 150)) + 
 ylim(c(0, 100)) +
 geom_point(alpha = 1/5, position = position_jitter(width = 0.8, height = 0.8)) +
 geom_smooth() +
 labs(x = "Nitric oxide concentration", y = "Nitrogen dioxide concentration")

grid.arrange(sp1)

with(dataset, cor.test(x = NO, y = NO2)) # 0.6968017

Fuel Consumption Ratings for Canadian Vehicules in 2014

I’m currently following a course called Data Analysis in R on Udacity. Part of the course involves loading up a dataset found online, doing some exploratory data analysis, and publishing my results.

I used the dataset called “2014 – Fuel Consumption Ratings” available on this page:

http://data.gc.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64

As per the description: “A yearly data set of all passenger vehicles sold in Canada based on their fuel-consumption ratings, estimated carbon-dioxide emissions and annual fuel costs.”

I use this command to load the dataset:

fuel <- read.csv("http://www.nrcan.gc.ca/sites/www.nrcan.gc.ca/files/oee/files/excel/MY2014%20Fuel%20Consumption%20Ratings.csv");

Here’s a few commands to clean up the data frame:

names(fuel) <- c("Model.Year", "Manufacturer", "Model", "Vehicle.Class", "Engine.Size.L", "Cylinders", "Transmission", "Fuel.Type", "Fuel.Consumption.City.L.100km", "Fuel.Consumption.Hwy.L.100km", "Fuel.Consumption.City.Mpg", "Fuel.Consumption.Hwy.Mpg", "Fuel.Ly", "Co2.Emissions.g.km");
fuel <- fuel[2:1068, ];

Let’s load up the ol’ ggplot, as well as gridExtra for added flavour:

library(ggplot2);
library(grid);
library(gridExtra);
theme_set(theme_gray(base_size = 6));

First let’s coerce some rows from factors to numeric, to make plotting easier:

fuel[, "Fuel.Ly"]                         <- as.numeric(as.character(fuel[, "Fuel.Ly"]));
fuel[, "Engine.Size.L"]                   <- as.numeric(as.character(fuel[, "Engine.Size.L"]));
fuel[, "Co2.Emissions.g.km"]              <- as.numeric(as.character(fuel[, "Co2.Emissions.g.km"]));
fuel[, "Fuel.Consumption.City.L.100km"]   <- as.numeric(as.character(fuel[, "Fuel.Consumption.City.L.100km"]));
fuel[, "Fuel.Consumption.Hwy.L.100km"]    <- as.numeric(as.character(fuel[, "Fuel.Consumption.Hwy.L.100km"]));
fuel[, "Fuel.Consumption.City.Mpg"]       <- as.numeric(as.character(fuel[, "Fuel.Consumption.City.Mpg"]));
fuel[, "Fuel.Consumption.Hwy.Mpg"]        <- as.numeric(as.character(fuel[, "Fuel.Consumption.Hwy.Mpg"]));

Let’s plot a few variables to get a feel for the data:

fuelLY <- qplot(
  data = fuel, 
  x = Fuel.Ly, 
  binwidth = 100,
  xlab = "Fuel (L/year)",
  ylab = "Count"
);

co2emiss <- qplot(
  data = fuel, 
  x = Co2.Emissions.g.km, 
  binwidth = 8,
  xlab = "CO2 Emissions (g/km)",
  ylab = "Count"
);

fuelconscity <- qplot(
  data = fuel, 
  x = Fuel.Consumption.City.L.100km, 
  binwidth = 0.5,
  xlab = "Fuel Consumption in City (L/100km)",
  ylab = "Count"
);

fuelconshwy <- qplot(
  data = fuel, 
  x = Fuel.Consumption.Hwy.L.100km, 
  binwidth = 0.3,
  xlab = "Fuel Consumption on Highway (L/100km)",
  ylab = "Count"
);

grid.arrange(fuelLY, co2emiss, fuelconscity, fuelconshwy, ncol = 2);

Rplot

Looks like these variables (fuel per year, co2 emissions, fuel consumption per km) are all roughly normally distributed, and the all have a slight skew towards the left. This is a very interesting pattern. It could represent the car maker’s recent effort to make models “greener”, pushing the overall fuel efficiency up.

Let’s look at the media fuel consumption in the city broken down by manufacturer:

fuel$Manufacturer <- factor(fuel$Manufacturer); #To reset the factors

firstPlot <- qplot(
    data = fuel, 
    x = Manufacturer,
    y = Fuel.Consumption.Hwy.L.100km, 
    binwidth = 0.3,
    xlab = "Manufacturer",
    ylab = "Fuel Consumption (L/100km)",
    geom= "boxplot"
) + coord_cartesian(xlim = c(0, 16.5));

secondPlot <- qplot(
    data = fuel, 
    x = Manufacturer,
    y = Fuel.Consumption.Hwy.L.100km, 
    binwidth = 0.3,
    xlab = "Manufacturer",
    ylab = "Fuel Consumption (L/100km)",
    geom= "boxplot"
) + coord_cartesian(xlim = c(16.5, 30.5));

thirdPlot <- qplot(
    data = fuel, 
    x = Manufacturer,
    y = Fuel.Consumption.Hwy.L.100km, 
    binwidth = 0.3,
    xlab = "Manufacturer",
    ylab = "Fuel Consumption (L/100km)",
    geom= "boxplot"
) + coord_cartesian(xlim = c(30.5, 40));

grid.arrange(firstPlot, secondPlot, thirdPlot, nrow = 3);

ok1 ok2 ok3

 

(I apologize for the small images. Please click on them to see higher resolution…)

There’s many manufacturers compared to the number of 2014 models they each have, making the data a bit hard to read. Of the bigger manufacturers, Mini, Mazda and Honda have the most models with low fuel consumption, with a median near 6.

So this concludes my exploratory data analysis. Hope you enjoyed it.