Plotting Distributions

Accessing a CSV file.

surfaceTemp <- read.csv("average-monthly-surface-temperature.csv")

Cleaning data - even if the data you select is “perfect”, use the tools provided in R to demonstrate that you can manipulate contents within the data. This can include dealing with missing values, removing commas, etc.

# Omits any null or NA data 
surfaceTemp <- na.omit(surfaceTemp)

Utilize the summary statistical functions provided to understand two of the variables in the file.

sort() - sorts the data in ascending or descending orders summary() - provides a collection of summary statistical values mean() - finds the arithmetic average (i.e., adds all of the values and divides by the total number of values) median() - finds the middle value when ordered mfv() - finds the most frequent value, also called the MODE. Note the use of the ‘Modeest’ package is needed sd() - finds the standard deviation max() - largest value in the variable min() - smallest value in the variable range() -lowest to highest values

# lengths of data 

lengthData <- nrow(surfaceTemp)


#variable one

averageOne <- surfaceTemp$Average.surface.temperature

#variable two
averageTwo <- surfaceTemp$Average.surface.temperature.1


# summary

avgOneSummary <- summary(averageOne)
avgTwoSummary <- summary(averageTwo)

# averages

avgOne <- mean(averageOne)
avgTwo <- mean(averageTwo)

avgAvg <- mean(avgOne, avgTwo)

#medians

medianTemp <- median(averageOne)
medianTempTwo <- median(averageTwo)

#Mode 

modeTemp <- mfv(averageOne)
modeTempTwo <- mfv(averageTwo)

rangeMode <- (max(modeTemp) - min(modeTemp))

# SD 
sdTemp <- sd(averageOne)
sdTempTwo <- sd(averageTwo)


# range of temps

maxAvg <- max(averageOne)
minAvg <- min(averageOne)
rangeAvg <- maxAvg - minAvg


# range of years

years <- surfaceTemp$year

oldestYear <- min(years)
latestYear <- max(years)

rangeYears <- oldestYear - latestYear

print(avgAvg)

## [1] 18.07207

print(avgOneSummary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -36.24   12.30   22.06   18.07   25.32   39.89

print(rangeYears)

## [1] -84

print(medianTemp)

## [1] 22.05579

print(modeTemp)

##  [1] 22.61807 22.98580 23.74204 24.02933 24.20334 24.34774 24.48684 24.67424
##  [9] 24.68173 24.81125 24.94287 24.95078 25.12210 25.20633 25.24775 25.29535
## [17] 25.33140 25.49304 25.57120 25.63913 25.87320 26.14557 26.26956 26.31015
## [25] 26.43381 26.45145 26.46195 26.59354 26.70111 26.76041 26.91000 26.91238
## [33] 27.15457 27.38791 27.67205

Pick one variable to plot with a histogram. Demonstrate that you understand how to work with the “breaks” option by showing two different plots based on different break values. Explain whether you think your different plots make sense. Be sure to add the solid distribution line. m

hist(averageOne,
     breaks = 100, 
     xlab = "Temperatures",
     main = "Histogram of the average surface temperatures (100 breaks)",
     ylab = "density",
     freq = FALSE)
lines(density(averageOne),
      col = "maroon",
      lwd = 4)

hist(averageOne,
     breaks = 500, 
     xlab = "Temperatures",
     main = "Histogram of the average surface temperatures (500 breaks)",
     ylab = "density",
     freq = FALSE)
lines(density(averageOne),
      col = "pink",
      lwd = 4)

Using the summary statistics, discuss how the histogram plot complements your understanding of the results.

When looking at the summary statistics, the data gathered had a min of -36 and a max of 40. The median was around 22 degrees while the mean was 18. From purely just that information, I gathered that the majority of the data would live around 15-25 degrees. When looking at the most frequented values, the values are all between 5 degrees of 22-27 degrees. The histogram complements with the visualization of the data is skewed to the right. This helps my understanding of how most of the surface temperature lives around 22 degrees. while there was once a low of -36, the probability of that happening is very low.

Plotting Distributions

Vaishnavi Kanakamedala

2025-02-05